Document Sanitization: Measuring Search Engine Information Loss and Risk of Disclosure for the Wikileaks cables
نویسندگان
چکیده
In this paper we evaluate the effect of a document sanitization process on a set of information retrieval metrics, in order to measure information loss and risk of disclosure. As an example document set, we use a subset of the Wikileaks Cables, made up of documents relating to five key news items which were revealed by the cables. In order to sanitize the documents we have developed a semi-automatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration, by (i) identifying and anonymizing specific person names and data, and (ii) concept generalization based on WordNet categories, in order to identify words categorized as classified. Finally, we manually revise the text from a contextual point of view to eliminate complete sentences, paragraphs and sections, where necessary. We show that a significant sanitization can be applied, while maintaining the relevance of the documents to the queries corresponding to the five key news items.
منابع مشابه
An Information Retrieval Approach to Document Sanitization
In this paper we use information retrieval metrics to evaluate the effect of a document sanitization process, measuring information loss and risk of disclosure. In order to sanitize the documents we have developed a semiautomatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration. It embodies two main steps: (i) identifying and anonymizing sp...
متن کاملAn information retrieval approach to document sanitization
In this paper we use information retrieval metrics to evaluate the effect of a document sanitization process, measuring information loss and risk of disclosure. In order to sanitize the documents we have developed a semiautomatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration. It embodies two main and independent steps: (i) identifying an...
متن کاملDetecting Term Relationships to Improve Textual Document Sanitization
Nowadays, the publication of textual documents provides critical benefits to scientific research and business scenarios where information analysis plays an essential role. Nevertheless, the possible existence of identifying or confidential data in this kind of documents motivates the use of measures to sanitize sensitive information before being published, while keeping the innocuous data unmod...
متن کاملUtility-preserving sanitization of semantically correlated terms in textual documents
Traditionally, redaction has been the method chosen to mitigate the privacy issues related to the declassification of textual documents containing sensitive data. This process is based on removing sensitive words in the documents prior to their release and has the undesired side effect of severely reducing the utility of the content. Document sanitization is a recent alternative to redaction, w...
متن کاملAutomatic Declassification of Textual Documents by Generalizing Sensitive Terms
With the advent of internet, large numbers of text documents are published and shared every day . Each of these documents is a collection of vast amount of information. Publically sharing of some of this information may affect the privacy of the document, if they are confidential information. So before document publishing, sanitization operations are performed on the document for preserving the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012